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I.  INTRODUCTION 


The  principle  of  minimum  cross-entropy  provides  a  general  method  of 
inference  about  an  unknown  probability  density  when  there  exists  a  prior 

*  f 

estimate  of  q  and  new  information  about  q  in  the  form  of  constraints  on 
expected  values.  The  principle  states  that,  of  all  the  densities  that  satisfy 
the  constraints,  one  should  choose  the  posterior  q  with  the  least  cross-entropy 
H[q,p]  ■h  q(x)log(q(x)/p(x)),  where  p  is  a  prior  estimate  of  q*’. 
Cross-entropy  minimization  was  first  introduced  by  Kullback  [1],  who  called  it 
minimum  directed  divergence  and  minimum  discrimination  information.  The 
principle  of  maximum  entropy  [2] , [3]  is  equivalent  to  cross-entropy 
minimization  in  the  special  case  of  discrete  spaces  and  uniform  priors. 

It  is  useful  and  convenient  to  view  cross-entropy  minimization  as  one 
implementation  of  an  abstract  information  operator  •  that  takes  two  arguments 

-  a  prior  and  new  information  -  and  yields  a  posterior.  Thus,  we  write 

the  posterior  q  as  q  ■  p*I,  where  I  stands  for  the  known  constraints  on 
expected  values.  Recent  work  has  shown  that,  if  the  operator  0  is  required  to 
satisfy  certain  axioms  of  consistent  inference,  and  if  •  is  implemented  by 
means  of  functional  minimization,  then  the  principle  of  minimum  cross-entropy 
follows  necessarily  [4] . 

Cross-entropy  minimization  satisfies  a  variety  of  interesting  and  useful 

properties  beyond  those  expressed  or  implied  by  the  axioms  in  [4] .  Some  of 

these  just  reflect  well-known  properties  of  cross-entropy  [1],[5],  but  there 

are  surprising  differences  as  well.  For  example,  cross-entropy  does  not  in 

general  satisfy  a  triangle  relations  involving  three  arbitrary  probability 

densities.  But  in  certain  important  cases  involving  densities  that  result 
Note:  Manuscript  submitted  January  23,  1980. 
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from  cross-entropy  minimization,  cross-entropy  satisfies  triangle  inequalities 
and  triangle  equalities.  (See  Properties  10,  12,  and  13.) 


It  is  the  purpose  the  present  paper  to  state  and  prove  various  fundamental 
properties  of  cross-entropy  minimisation.  For  completeneas,  we  also  restate 
the  axioms  from  (4].  After  introducing  necessary  definitions  and  notation  in 
Section  II,  we  consider  first  properties  that  are  valid  for  both  equality  and 
inequality  constraints  on  expected  values  (Section  III)  and  then  properties 
that  are  valid  only  for  equality  constraints  (Section  IV).  We  conclude  with  a 
brief  discussion  in  Section  V.  We  also  include  an  Appendix  in  which  we 
discuss  general  analytic  and  computational  methods  of  finding  minimum 
cross-entropy  posteriors. 


II.  DEFINITIONS  AND  NOTATION 

In  this  section,  we  introduce  the  same  notation  as  in  [4,  Section  II]. 

The  discussion  here  places  somewhat  greater  emphasis  on  mathematical  questions 
relating  to  the  existence  of  minimum-cross-entropy  solutions.  (See  also  the 
discussion  following  Property  1.) 

We  use  lower-case  boldface  Roman  letters  for  system  states,  which  may  be 
multidimensional,  and  upper-case  boldface  Roman  letters  for  sets  of  system 
states.  We  use  lower-case  Roman  letters  for  probability  densities,  and  upper 
case  script  letters  for  sets  of  probability  densities.  Thus,  let  j  be  a  state 
of  some  system  that  has  a  set  D  of  possible  states.  LetaD  be  the  set  of  all 
probability  densities  q  on  D  such  that  q(x)>  0  for  x€D  and 

f  *5  *  1  •  0> 

JD 

We  use  a  dagger  *  to  distinguish  the  system's  unknown  "true"  state  probability 
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density  q^£«0.  When  SfiD  is  sosie  set  of  states,  we  write  q(x€jS)  for  the  set 
of  values  q(x)  with  x6S, 

Mew  information  takes  the  form  of  linear  equality  constraints 


L 


di  *  (i)ak(5)  "  \ 


(2) 


and  inequality  constraints 


i. 


I  A 

dx  q  (x)c.(x)  ^  c. 
•  1  «*  k  ••  k 


(3) 


for  known  sets  of  functions  a^,  c^,  and  known  values  a^,  c^.  The 
probability  densities  that  satisfy  such  constraints  always  comprise  a  convex 
subset  ofjO  .  (A  set  >4  is  convex  if,  given  0$A<1  and  q,r  £</,  it 
contains  the  weighted  average  Aq+(1-A)r. )  We  refer  to  the  functions  a^, 
as  constraint  functions  and  as  a  constraint  set.  For  a  given 
constraint  set  there  may  of  course  be  more  than  one  set  of  constraint 
functions  in  terms  of  which  it  may  be  defined.  We  frequently  suppress  mention 
of  a  particular  set  of  constraint  functions,  using  the  notation  I  -  (q*W>  to 
mean  that  q*^  is  a  member  of  the  constraint  set  and  referring  to  I  as  a 

constraint.  We  use  upper-case  Roman  letters  for  constraints. 

Let  p  be  some  prior  density  that  is  an  estimate  of  q^*  obtained,  by  any 
means,  prior  to  learning  I.  We  require  that  priors  be  strictly  positive: 

p(x€D)  >  0  .  (4) 

(This  restriction  is  discussed  below.)  Given  a  prior  p  and  new  information  1, 
the  posterior  density  that  results  from  taking  I  into  account  is  chosen 

by  minimising  the  cross-entropy  Hlq,p]  in  the  constraint  set^  : 


Hlq,p) 


mir^  H I  q '  ,p] 


(5) 
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where 


H{q,p]  »  i  dx  q(x)log(q(x)/p(x))  .  (6) 

D 

(V 

We  introduce  an  "information  operator"  e  that  expresses  (5)  using  the  notation 
q  ■  p*I  .  (7) 

The  operator  •  takes  two  arguments  -  a  prior  and  new  information  -  and 

yields  a  posterior. 

For  some  subset  S  £  D  of  states  and  x(S.  let 

q(x  |  xfc  S)  «q(x)/f  dx'  q(x')  (8) 

1  s 

*0 

be  the  conditional  density,  given  x€  S,  corresponding  to  any  q6«D .  We  use 
q(x  |  x€S)  -  q*S  (9) 

A#  4V  Af  #v 

as  a  shorthand  notation  for  (8). 


In  making  the  restriction  (4)  we  assume  that  D  is  the  set  of  states  that 
are  possible  according  to  prior  information.  We  do  not  impose  a  similar 
restriction  on  the  posterior  q  ■  p*I  since  I  may  rule  out  states  currently 
thought  to  be  possible.  If  this  happens,  then  D  must  be  redefined  before  q  is 
used  as  a  prior  in  a  further  application  of  *.  The  restriction  (4)  does  not 
significantly  restrict  our  results,  but  it  does  help  in  avoiding  certain 
technical  problems  that  would  otherwise  result  from  division  by  p(x).  For 
more  discussion,  see  [5] . 


When  D  is  a  discrete  set  of  system  states,  densities  are  replaced  by 
discrete  distributions  and  integrals  by  sums  in  the  usual  way.  In  a  more 
general  setting  for  the  discussion  than  we  have  chosen,  D  would  be  a 
measurable  space,  and  p  and  q  would  be  replaced  by  prior  and  posterior 
probability  measures.  By  continuing  to  write  in  terms  of  probability 


i-lj *£**  ,^ii>  •%  -mti  *- 


densities,  we  would  then  be  implicitly  assuming  some  underlying  measure  with 
respect  to  which  the  rest  were  absolutely  continuous.  Indeed  such  a  measure 
certainly  exists  if  we  demand  that  no  event  with  aero  prior  probability  can 
have  positive  posterior  probability,  which  in  the  present  context  we  are  in 
effect  demanding  by  assuming  (A). 


III.  PROPERTIES  GIVEN  GENERAL  CONSTRAINTS 


This  Section  concerns  properties  that  apply  in  the  case  of  both  equality 
and  inequality  constraints  (2)-(3).  We  follow  the  formal  statement  of  each 
property  with  a  brief  discussion  and  then  a  proof  or  an  appropriate 
reference.  Throughout,  we  assume  a  system  with  possible  states  D,  probability 
density  q^€  JO,  an  arbitrary  prior  p<£,  and  arbitrary  new  information 
I  -  (q“W  ),  where JO  contains  at  least  one  density  q  such  that  H(q,p)<  00 . 

Property  1^  (Uniqueness) :  The  posterior  q  18  pal  is  unique. 

Discussion:  A  solution  to  the  cross-entropy  minimization  problem,  if  one 
exists,  is  unique  provided  only  that  H[q,p]  is  not  identically  infinite  as  q 
ranges  over  the  constraint  set>^  .  To  guarantee  that  a  solution  exists,  a 
little  more  is  required.  One  condition  that  suffices  for  existence  is  that, 
in  addition  to  containing  a  density  q  with  finite  cross-entropy,  the 
constraint  set  be  closed.  (We  call  closed  if  it  contains  every 
probability  density  q  that  is  a  limit  of  densities  q^  .  Limits  are  taken 
in  the  sense  that  q.-*q  means  I  |q.(x)-q(x)|  dx  -*0.)  For  to  be 
closed,  it  suffices  in  turn  that  the  constraint  functions  be  bounded.  (And 
conversely,  any  closed  convex  set  of  probability  densities  can  be  defined  by 
equality  and  inequality  constraints  (2),  (3)  with  bounded  constraint 
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functions,  except  that  infinitely  many  may  be  required.)  It  is  also  possible 
to  assert  existence  of  p*l  under  less  stringent  conditions,  which  do  not  imply 
that  4  is  closed  —  see  Theorem  3.3  in  [6]  and  Appendix  A.  This  is 
fortunate,  since  a  number  of  examples  of  practical  importance  involve 
unbounded  constraint  functions. 

Proof  of  li  See  [6],  [4,  Section  IV. E] . 

Property  2i  The  posterior  satisfies  q  ”  p*I  ■  p  if  and  only  if  the  prior 
satisfies  p €4 . 

Discussion:  If  one  views  cross-entropy  minimization  as  an  inference  procedure, 
it  makes  sense  that  the  posterior  should  be  unchanged  from  the  prior  if  the 
new  information  doesn't  contradict  the  prior  in  any  way.  Consider  the  example 
of  (A.IO)-(A. 12).  If  a^  *  x^  for  k  ■  l,...,n,  then  q(x)  ■  p(x). 

Proof  of  2:  Property  2  follows  directly  from  the  property  of  cross-entropy 
that  H[q,p]£  0  with  H[q,p]  -  0  only  if  q  *  p  (U,  p.  141). 

Property  3  ( idempotence) :  (p*l)«I  ■  p»I  . 

Discussion:  Taking  the  same  information  into  account  twice  has  the  same 
effect  as  taking  it  into  account  once. 

Proof  of  3:  Since  <psl)6^,  idempotence  follows  from  Property  2. 

Property  4:  Let  1^  be  the  information  1^  ■  (q*€^j)  and  let  Ij  be 
the  information  I^  ”  for  overlapping  constraint  sets 

If  (p'ljte^j  holds,  then 

psij  -  (poI1)®(I1AI2)  -  (pslj)#^  -  ps(I1Ai2)  do) 


holds 


i  rirt 


PiicuMlon;  If  the  result  of  taking  information  Ij  into  account  already 
satisfies  constraints  imposed  by  additional  information  Ij,  taking  I 2  into 
account  in  various  ways  has  no  effect.  For  example,  let  1^  and  I2  be  the 
constraints 

f*L 

and  ® 

[  dx  x2qf(x) 

■*0 

respectively.  For  an  exponential  prior  p(x)  “  r  exp(-rx),  the  posterior  given 

Ij  is  q  -  p*Ij  *  (l/a)exp(-x/a)  (see  (A. 10)-(A. 12)) .  The  second  moment  of 

q  is  just  2a2,  so  that  q  satisfies  qf^,  as  well  as  q  ■  qe(l  AI  ), 

q  ■  qel^,  *«d  *1  “  p®(IjAl  ).  if  the  right  side  of  (11)  were  anything 
2 

but  2a  ,  the  result  of  psd^AIj)  would  be  a  truncated  Gaussian  or 
undefined  and  not  an  exponential  [7,  pp.  133-140] 

Proof  of  4;  Since  (pel^C ^  holds  and,  by  assumption,  (p-Ij)^ 
also  holds,  it  follows  that  (p®I^) €  l^^2>  T*1®  f*r8t  two 

equalities  of  (10)  then  follow  directly  from  Properties  2  and  3.  The  last 
equality  of  (10)  follows  from  q  ■  p®I^  having  the  smallest  cross-entropy 
H[q,p]  of  all  densities  in  j  and  therefore  in^/W^. 

Property  5^  (Invariance):  Let  jT  be  a  coordinate  transformation  from  x«D 
to  y€  D'  with  (jPq) (^)  *  J  ^q(x) ,  where  J  is  the  Jacobian 
J  “  )(y)/J(x).  Let  rjD  be  the  set  of  densities  F  q  corresponding  to 
densities  q€2)  .  Let  (<T^)£(F<2))  correspond  to  ^c£).  Then 

(rp)*(Fl)  -  r(p#l)  (12) 

and 


2a2, 


(11) 
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Hir(poi),rp]  -  H[p«i,  p] 

hold,  where  Tl  *  ((fq^)6  (T^)). 


(13) 


Discussion:  Eq.  (12)  sCates  that  the  same  answer  is  obtained  when  one  solves 

the  inference  problem  in  two  different  coordinate  systems,  in  that  the 

posteriors  in  the  two  systems  are  related  by  the  coordinate  transformation. 

Moreover,  the  cross-entropy  between  the  posteriors  and  the  priors  has  the  same 

value  in  both  coordinate  systems. 

As  an  example,  let  y^  and  y^  be  the  real  and  imaginary  parts  of  a 

2  2 

complex  sinusoidal  signal;  let  x^  be  the  total  power  x^  *  y^+y^, 
and  let  x^  be  the  phase,  so  that 

(yl’y2*  =  r<Vx2>  =  (x|^2c°s(x2),  xj^2sin(x2)). 

Then  the  Jacobian  is  constant: 


•^■x.  1^2cos(x_) 

Z  1  2 

ix. 1^2sin(x.) 

%  1  2 


-xj^2sin(x2) 


x^2cos(x2) 


1/2  . 


Therefore,  if  the  prior  density  p(x)  is  uniform  in  some  region  in  the  x 
coordinate  space,  the  transformed  prior  (rp)(y)  will  be  uniform  on  a 
corresponding  region  in  the  y  coordinate  space.  For  example,  suppose 
'l/2WR2,  (O^XjSR2,  -JT <x2«?T) 

p(x)  *  < 

0  ,  otherwise  , 

which  makes  p  uniform  in  a  certain  rectangle.  Thus,  we  find  that 


(rP)(y) 


1^-R2,  (y2+y2^R2) 


0  ,  otherwise  , 
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2  — 1  _ 2 

which  makes  Tp  uniform  on  a  certain  disk.  (Notice  1/ltR  “ J  (1/2*%  ).) 

A#  99 

Now,  let  new  information  I  specify  the  expected  power 


r«e  A T 

J*  1 Jdx2  Xlq  (5>  “ 


The  resulting  posterior  q  ■  p*I  is  exponential  with  respect  to  x^s 


-  ■{! 


A  expl-Xxj],  (0^Xj<  R  ,  -ff<x2Sft*  ) 


otherwise 


for  certain  constants  A  and  A .  The  new  information  in  the  transformed 
coordinates, r  I,  is 


Jdyijdy2  (yf+7pq'f(y)  m  P, 


and  the  resulting  posterior  q'  *  (Tp)®(ri)  has  the  form  of  a  bivariate 


Gaussian  inside  the  disk: 


I2 

q,(y}  -  j 


2A  expl- A(yj+y2)l »  (yj+y^O2) 


,  otherwise 


The  two  posteriors  q  and  q'  are  related  by  q'(j£)  ”  as  at*1***  in  02). 

Proof  of  5:  See  (4,  Section  IV. E] .  The  proof  of  (12)  follows  directly  from 
the  fact  that  cross-entropy  is  transformation  invariant.  Eq.  (13)  is  just  a 
special  case  of  this  invariance. 

Property  6  (System  Independence) :  Let  there  be  two  systems,  with  sets 
D.  and  D.  of  states  and  probability  densities  of  states  q.^JD,  and 
q2£&2‘  Let  Pi€*^  i  *nd  P-fi®  2  be  prior  densities.  Let 
xi  “  (qWi>  and  I2  "  (q2€^2)  be  new  information  about  the 
two  systems,  wher eS/jfi^Dj  and>/2£^.  Then 


-acrgrr— 7 


(14) 


(Pl»2)<(liAl2)  "  <P1°I1)<P2*I2) 

and 

P1P2J  "  H*ql»  P1J  +  H*q2’  p2*’  <15) 

hold,  where  qj  ■  peij  and  q2  ■  polj. 

Discussion:  Property  6  states  that  it  doesn't  natter  whether  one  accounts  for 
independent  information  about  two  systems  separately  or  together  in  terms  of  a 
joint  density.  Whether  or  not  the  two  systems  are  in  fact  independent  is 
irrelevant:  The  property  applies  as  long  as  there  are  independent  priors  and 
independent  new  information.  Examples  can  easily  be  generated  from  the 
multivariate  exponential  and  multivariate  Gaussian  examples  in  the  Appendix. 
Proof  of  6:  See  [4,  Section  IV. E] 

Property  7  (subset  independence):  Let  S^,...,SQ  be  disjoint  sets 

whose  union  is  D.  Let  the  new  information  I  comprise  information  about 

the  conditional  densities  q  *S..  Thus,  I  ■  I/1/...AI  ,  and 

~i  T  12  n’ 

I.  *  (q  where  and^  ^  is  the  set  of 

densities  on  S..  Let  M  -  (q^lW  )  be  new  information  giving  the 
probability  of  being  in  each  of  the  n  subsets,  where  is  the  set  of 
densities  q  that  satisfy 


(  dx  q(x)  ■  m. 

Jo 


for  each  subset  S.,  where  the  m.  are  known  values.  Then 


(pe(IAM))*S.  -  (p*S.)ei 

r  Wl  r  »1  1 


»(IAM),  p]  -  y  m.H[q.,p.)  ♦  /  a. log  ~ 

i  ii 
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(18) 


hold,  where  ■  p*S; ,  q^  •  p.el^,  end  the  s.  are  the  prior 
probebilitiee  of  being  in  each  eubaet, 


Discussion:  Ti-is  property  concerns  situations  in  which  the  set  of  states  £ 
decomposes  naturally  into  disjoint  subsets  S^,  in  which  the  new  information 
I  *  IjA  I^A  ,..Aln  comprises  disjoint  information  about  the  conditional 
probability  densities  q”^ *S^  in  each  subset,  and  in  which  there  is  also  new 
information  M  giving  the  total  probability  bk  of  being  in  each  subset 
Given  this  information,  there  are  two  ways  to  obtain  posterior  conditionel 
densities  for  each  subset:  One  way  is  to  obtain  a  conditional  posterior 
(p*S.)oi^  from  each  conditional  prior  p*S^.  Another  way  is  to  obtain  a 
posterior  q  *  p»(IAM)  for  the  whole  system  and  then  to  compute  a  conditional 
posterior  q*^ .  Property  7  states  that  the  results  are  the  same  in  both 
cases:  it  doesn't  matter  whether  one  treats  an  independent  subset  of  system 
states  in  terms  of  a  separate  conditional  density  or  in  terms  of  the  full 
system  density. 

To  illustrate  Property  7,  suppose  that  a  six-sided  die  was  rolled  a  large 
number  of  times.  The  frequencies  with  which  the  different  die  faces  turned  up 
were  not  recorded  individually,  but  the  mean  number  of  spots  showing  was 
determined  separately  for  the  odd  results  and  for  the  even  results.  There  was 
no  prior  reason  to  expect  any  face  of  the  die  to  turn  up  more  often  than  any 
other.  Indeed,  the  probability  for  the  number  of  spots  showing  to  be  odd 
turned  out  to  be  .5.  However,  the  mean  number  of  spots  showing,  given  that 
the  number  was  odd,  was  found  to  be  4;  the  mean  number  of  spots  showing,  given 
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that  the  number  was  even,  also  was  found  to  be  4.  Given  this  information,  we 

are  asked  to  estimate  the  probability  for  each  face  of  the  die  to  turn  up,  as 

well  as  the  conditional  probability  given  whether  the  face  is  odd  or  even. 

Let  Sj  *  {1,3,5}  and  ■  {2,4,6}.  We  will  first  solve  the  problem  on 

S,  and  S„  separately  and  then  solve  it  on  S,WS_. 

~1  “v2  v  **1  —2 

In  all  cases,  the  prior  is  uniform.  The  prior  pj  on  is 
pjd)  »  p^(3)  *  Pj(5)  *  1/3.  The  information  lj  giving  the  expected 
value  for  an  odd  number  of  spots  is 

2  nq  (n)  -  4  ; 

therefore,  we  compute  a  posterior  *  p^olj  on  by  minimising 
Hlq^p.]  subject  to  qj(l)  +  3qj(3)  +  5q^ ( 5)  *  4.  The  result  is 

qjd)  -  0.1162,  qx(3)  -  0.2676,  q^)  -  0.6162  .  (19) 

Similarly,  the  prior  p2  on  Sj  is  p2<2)  -  p2(4)  ■  p2(6)  ■  1/3, 
the  posterior  q2  is  subject  to  the  constraint  I2, 

2q2(2)  +  4q2<4)  +  6q2<6)  -  4, 

and  the  result  of  minimizing  H{q2,p2l  is 

q2<2)  -  1/3,  q2<4)  -  1/3,  q2<6)  -  1/3.  (20) 

On  SjV)  s2>  the  prior  p  is  p(l)  ■  p(2)  »  •  p(6)  ■  1/6.  The 

f 

information  1^,  which  concerns  q  *S^ ,  may  be  expressed  as 
q^l)  +  3q*(3)  +  5q^(5)  -  4(qr(l)  ♦  qf(3)  ♦  qf(5)). 

We  therefore  subject  the  posterior  q  to  the  constraint 
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-  3q(l)  -  q(3)  +  q(5)  -  0 


(21) 


Similarly ,  because  of  Ij,  we  have  the  constraint 

-  2q(2)  +  2q(6)  -  0.  (22) 

Finally,  because  of  the  information  M,  ve  subject  q  to  the  constraint 

q(l)  -  q(2)  +  q(3)  -  q(4)  +  q(5)  -  q(6)  -  0,  (23) 

since  this  is  equivalent  to  q(l)  +  q(3)  ♦  q(S)  ■  .5  *  q(2)  +  q(4)  ♦  q(6). 

Upon  minimising  H[q,p]  subject  to  the  constraints  (21)-(23),  we  find  that 
q  ■  pe(IjM,AM)  is  given  by 


q(  1 ) 

-  0.0581, 

q(2)  -  1/6, 

q(3) 

-  0.1338, 

q(4)  -  1/6, 

(24) 

q(5) 

-  0.3081, 

q(6)  -  1/6  . 

To  find  the  conditional  probabilities  q*S^  and  q*^,  we  divide  both 
columns  in  this  result  by  .3;  the  results  agree  with  q^  and  q^  as  computed 
above  ((19), (20)),  and  as  stated  in  (16). 

Proof  of  7:  See  [4,  Section  IV. E] . 

Property  8  (weak  subset  independence):  For  the  same  definitions  and 
notation  as  Property  7, 

(p«I)*S.  -  (p*S.)ei  (25) 

’*1  *1  X 

and 

Hlp»I,  p]  •  ^  r.Hlq^  pj)  ♦  £  rile*  IT 
hold,  where  p^  ■  p*J. ,  q^  ■  p^*I^ ,  the  s^  are  the  prior 
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probabilities  of  being  in  each  subset  (18),  and  the  are  the  posterior 
probabilities  of  being  in  each  subset, 

ri  “  f  d5  q<*>  »  <**> 

Js. 

**1 

for  q  »  pel. 

Discussion:  This  property  states  that  the  two  ways  of  obtaining  the  posterior 
conditional  densities  also  lead  to  the  state  result  in  the  case  when  one  does 
not  have  information  giving  the  total  probability  in  each  subset.  Results  for 
the  full  system  posterior,  however,  will  not  in  general  be  the  same  for  the 
cases  covered  by  Properties  7  and  8.  That  is,  q*I  and  q*(IAM)  will  not  in 
general  be  equal. 

To  illustrate  Property  8,  we  solve  the  example  problem  from  Property  7, 
omitting  the  information  M  that  the  probability  of  an  odd  (or  of  an  even) 
number  of  spots  is  .5.  The  separate  solutions  on  Sj  and  S2  proceed 
exactly  as  before  and  yield  the  same  posteriors  q j  and  q^.  The  solution 
on  S.tS  S„  differs  from  the  previous  one  only  in  that  we  minimize  H[q,p] 
subject  to  the  constraints  (21)  and  (22),  but  not  subject  to  (23).  The 
result,  q'  “  padjAI^),  is  given  by 


q  *  ( 1 ) 

-  0.0524, 

q  *  ( 2) 

-  0.1831, 

q'(3) 

-  0.1206, 

q'(4) 

-  0.1831, 

q’(5) 

-  0.2778, 

q'(6) 

-  0.1831, 

and  differs  from  the  previous  result  (24).  Moreover,  the  subset  probabilities 

r^  and  r ^  do  not  satisfy  M:  sunning  the  two  columns  gives  r^  ■  0.4508 

and  rg  ■  0.5492.  However,  dividing  the  two  columns  respectively  by  r^  and 
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give*  the  same  conditional  probabilities  as  before}  ■  q^  and 

q'*^  •  q2  <•««  <19),  (20)). 

Proof  of  8t  for  q  “  p*I,  let  r^  be  given  by  (27).  Then  let  R  be 
information  R  ■  q^,  where  is  the  set  of  densities  satisfying  (27).  It 
follows  from  Property  4  that  p*I  ■  ps(IAR)  holds;  (25)  and  (26)  then  follow 
from  Property  7. 

Property  9  (subset  aggregation) :  Let  jgj,  S^,  ...»  ^  be  disjoint 

sets  whose  union  is  D.  Let  ♦  be  a  transformation  such  that,  for  any  qCD, 

q'  ■  jj>q  is  a  discrete  distribution  with 

q'(x.)  -  fdx  q(x), 

■'s. 

where  x.  is  a  discrete  state  corresponding  to  x€j5^.  Thus  the 
transformation  4*  aggregates  the  states  in  each  subset  JS^.  Suppose  new 
information  I'  *  ((^W')  is  obtained  about  the  aggregate  distribution 


l|>q*\  where *4'  is  a  convex  set  of  discrete  distributions. 

Then  for  any 

prior  p(3, 

p*S.  ■  (p*l)*S.  , 

(28) 

(♦p)«l‘  -  <|>(poi)  , 

A*  0* 

(29) 

and 

Hl4»(psl),^p]  -  H[p*I,p] 

(30) 

all  hold,  where  I  •  ^  *1'  is  the  information  I*  expressed  in  terms  of  q* 

*0 

instead  of  in  terms  of  *|>q^.  (That  is,  I  •  (q*fe  (^  >/’ )),  where 

<v 

are  the  densities  q  such  that  (}q)6^’.) 


Discussion:  Note  that  (29)  and  (30),  in  which  ^  it  a  many-to-one  (tapping, 
have  the  same  fora  as  the  invariance  property,  which  holda  for  one-to-one 
coordinate  transformations  f  (see  (12)-(13)).  Indeed,  both  invariance  and 
subset  aggregation  can  be  viewed  as  special  cases  of  a  wore  general, 
measure-theoretic  invariance.  In  mathematical  tarns,  the  operator  •  is 
functorial. 

Proof  of  9:  Let  the  information  I*  be  a  set  of  known  expectations 

g^q^Hx^),  for  k  ■  l,...,m,  or  bounds  on  these  expectations,  where 
q*  *  *  ff.  In  terms  of  this  becomes  a  set  of  known  or  bounded  expectations 

J  dx  q^ffcU)  , 

S 

where  f,  (x£S.)  ■  g,  .  is  constant  in  each  subset  S..  The  posterior 

k  ~  «<i  "ki  **i 

q  *  pal  has  the  form 


q(x)  -  p(x)  exp  \  2  \fk(x)^  » 


(31) 


where  some  of  the  terms  in  the  summation  over  k  may  be  omitted  in  the  case  of 
inequality  constraints  (see  (A.4)).  Since  f^  is  constant  on  each  subset, 

(31)  has  the  form  q(x€S.)  ■  A.p(x£S.),  where  A.  is  a  subset 

»  1  ~  'vl  1 

dependent  constant.  This  proves  (28).  Now,  in  general  for  any  q,p  ,  the 
cross-entropy  H[q,p]  can  be  expressed  (4)  as 

*U.P1  -  fjHtlj.Pj]  *  ^  ,  *,»•*  7^ 

where  p.  -  p*S.,  q£  -  q*S., 


(32) 


In  the  present  eese  we  have  q^  ■  p.  from  (28).  Since  Hlq^ ,q ,  ]  »  0, 

(32)  reduces  to 

Hlq,p]  -  T  r.log 

*~i  1  *i 

■  H[tpq,^pl  . 

tm  — 

Minimising  the  left  side  subject  to  Xf  yielding  q  *  pel,  is  equivalent  to 
minimizing  the  right  side  subject  to  1*.  This  proves  (29)  and  (30). 

Property  10  (triangle  relations):  For  any  r£^, 

H[r,p]  £  H(r,q]  +  Hlq,p]  ,  (33) 

where  q  ”  pel.  When  I  is  determined  by  a  finite  set  of  equality 
constraints  only,  equality  holds  in  (33). 

Discussion;  The  triangle  equality  is  important  for  applications  in  which 
cross-entropy  minimization  is  used  for  purposes  of  classification  and  pattern 
recognition. 

Proof  of  10:  We  have 

H[q,p]  ■  min  H[q',p]. 

q'fel 

The  densities  q'  ■  (l-t)q  ♦  tr  belong  to for  all  t6[0,l]  since  q€a/,  r£t/, 
andx/  is  convex.  For  all  such  t  we  therefore  have 

H[(l-t)q+tr,  p]  ^  H(q,p]  ,  (34) 

or  F(t)  ">/  F(0)  ,  where  we  have  written  F(t)  for  the  left  side  of  (34).  It 
follows  that  F'(0)^  0  (provided  F  is  differentiable  at  0).  We  therefore  set 

♦  “<2>1  >«*  tr<»> )  >  0 
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and  differentiate  under  the  integral  aign.  (For  justification  of  thia  atap 
and  the  exiatence  of  7'(0),  aee  Csiszar  {6],  who  givea  the  proof  in  a  wore 
general  measure-theoretic  aetting.)  The  reault  ia 

(J di  ‘  I'i51  l0«  tr<4> 

♦  Kl-t)«l(s)  *  *  0  , 

or 

Jdx  Ir(x)  -  q(x)Hl  +  log  ^  0  . 

This  implies 

J*.  r(j)  log  $$  »  J*l  1<«>  log 

since  I  dx  [r(x)-q(x)J  *  0.  Therefore, 

J  ~  ~  O' 

jdx  r(x)  log  >y  jdx  r(x)  log  ^  *  jdx  q(x>  log  . 

Consequently  H[r,p]  ^  H[r,q]  +  H[q,p]. 

How  assume  I  is  determined  by  finitely  many  equality  constraints.  Since 
q  “  pel,  log(q(x)/p(x))  assumes  the  form 

108  $fr  *  -V|W5> 

(cf.  (A.4)).  But  then 

Jd,  t(j)  log  ^  -  -*o  -  ZVk  -  J*1  I<i>  log  -  Hiq.p)  , 
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since  r  and  q  both  satisfy  the  equality  constraints.  The  equality 
j*c  r(x)  log  -  Jdx  r(x)  log  ♦  J<*  r(x)  log 


then  implies  H[r,p]  “  H[r,q]  ♦  H[q,p). 


Property  11s 

Hlq^,p*lJ  ^  H[q+,p] , 
holds  with  equality  if  and  only  if  p«l  ■  p. 


Discussion!  This  property  states  that  the  posterior  q  “  pel  is  always  closer 
to  q  ,  in  the  cross-entropy  sense,  than  is  the  prior  p. 

Proof  of  lit  Since  q^fc'/  holds,  (35)  follows  directly  from  (33)  with  r  •  q^. 

IV.  PROPERTIES  GIVEN  EQUALITY  CONSTRAINTS 

This  Section  concerns  properties  that  apply  when  some  of  the  new 
information  is  in  the  form  of  equality  constraints  (2)  only.  Throughout,  we 
assume  a  system  with  possible  states  D  and  an  arbitrary  prior  p  €%). 

Property  12.  Let  the  system  have  a  probability  density  and  let 

there  be  information  I  •  (,W  >  that  is  determined  by  a  finite  set  of 
equality  constraints  only.  Then 

H[q***,p]  ■  H[q*,q]  +  H[q,p]  (36] 

holds,  where  q  *  p*I. 

Discussion!  Since  the  difference  H[q  ,pJ-H[q  ,q)  is  just  H(q,p],  and  since 
H[q,p]  is  a  measure  [1]  of  the  information  divergence  between  q  and  p, 
Property  12  shows  that  H(p«I,p]  can  be  interpreted  as  the  amount  of  infor¬ 
mation  provided  by  I  that  was  not  already  inherent  in  p.  Stated  differently, 
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H[p*I,p]  is  the  amount  of  information-theoretic  distortion  introduced  if  p  is 
used  instead  of  pel.  Since,  for  any  prior  p  and  any  density  r  €#0  vith 
H(r,p)^*°,  there  exists  a  finite  set  of  equality  constraints  If  such  that 
r  -  p«>Ir  (  see  appendix  B),  H[r,pJ  is  in  general  the  amount  of  information 
needed  to  determine  r  when  given  p,  or  the  amount  of  information-theoretic 
distortion  introduced  if  r  is  used  instead  of  p. 

Proof  of  12;  Eq.  (36)  follows  directly  from  (33),  since  holds. 

Property  13:  Let  the  system  have  a  probability  density  and  let 

there  be  information  1^  ■  (q*^6>/j)  and  information  -  (qfe«/2), 
where  are  constraint  sets  with  a  non-empty  intersection. 

Suppose  that^ ^  is  determined  by  a  set  of  equality  constraints  (2) 
only.  Then 

(p^ljXljA^)  -  p^IjAlj)  (37) 

and 

H[q,p]  =  H[q,q  jl  +  Hlq^p]  (38) 

hold,  where  q  *  p®(ljAI^)  and  q^  ■  pelj. 

Discussion;  When  1^  is  determined  by  equality  constraints,  (37)  holds 
whether  or  not  (p®Ij  )eV2  (compare  with  Property  4).  Property  13  is 
important  for  applications  in  which  constraint  information  arrives  piecemeal, 
and  states  that  intermediate  posteriors  can  be  used  as  priors  in  computing 
final  posteriors  without  affecting  the  results.  Like  (33)  and  (36),  the 
triangle  equality  (38)  is  important  for  applications  in  which  cross-entropy 
minimization  is  used  for  purposes  of  classification  and  pattern  recognition. 

As  an  example  of  Property  13,  we  consider  minimum  cross-entropy  spectral 
analysis  [8].  If  one  describes  a  stochastic,  band-limited,  discrete-spectrum 


'.V»“  ■. 


,  “X  f 

signal  in  terms  of  a  probability  density  q  (jt)  *  q  (xj,...,xn),  where 
is  the  energy  at  frequency  ffc,  known  values  of  the  autocorrelation  function 
can  be  expressed  as  expectations  of  q  ,  namely, 


where  R^  is  the  autorcorrelation  sample  at  lag  t^.  Let  be  a  limited 

set  of  autocorrelations  Rj . Rm«  Then,  for  a  prior  p^  with  a  flat 

(white)  power  spectrum  Pfc  ■  J dx  x^p^Cx)  •  P,  the  power  spectrum  of  the 
posterior  qLPQ  *  Pw»Ij  just  t*ie  nth-order  maximum  entropy  or  Linear 
Predictive  Coding  (LPC)  spectrum  [81.  Let  I2  be  the  set  of  autocorrelation 
samples  Rm+i ’Rm+2* *  *  *  that  t08etlier  w^th  lj  fully  determine  the  power 
spectrum  of  q^\  Then  (37)  yields  q?  ■  p^adjAl^)  *  q^p^djAI^) . 

Proof  of  13;  The  density  q^  has  the  form  (A. 4), 


tl(x)  -  p(x)  exp  * 


For  an  arbitrary  density  q€Z),  the  cross-entropy  with  respect  to  q^  satisfies 

q(x)expUQ  + 


H[q,  q.J 


q(x)  log 


p(j) 


-  H[q,  pi  +  \  +  J  dx  q(x)  XfcVk^ 

If  q  satisfies  qfr^j,  this  becomes 

H[q,  qjl  -  H[q,  p]  +  AQ  +  X  k*k  V 


(39) 


where  Aq,  and  a^  are  constants.  Since  R[q,  q^l  and  H[q,  p]  differ 
by  a  constant  on^j,  it  follows  that  they  have  the  same  minima  on  any  subset 
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of^j.  Since  2)c^  holds,  this  proves  (37).  Moreover,  (39) 

and  (A.S)  yield  (38),  which  is  also  a  special  case  of  (33). 


Property  14.  Suppose  there  are  two  underlying  probability  densities  q^ 
and  Let  1^  and  I2  stand  respectively  for  the  sets  of  equality 

constraints 


and 


f“> 

1 

(1  m  1 1 • • *  >w) 

(40) 

f<2) 

1 

(i  *  !,»•*, s)  , 

(41) 

where  s^  m.  Then 


(p»I1)«(l2)  -  p«I2  (42) 

.  -  .  ,(D  a  (12)  .  5(2) 

holds.  Moreover,  if  ,  A.  »  an<*  are  the 

Lagrangian  multipliers  associated  with  q^  »  p*Ij,  *  <,1°12’ 

and  q2  *  p»I2,  respectively,  then 


^2) 

■  *lli  « 

k 

K 

n 

0 

• 

• 

• 

’•a 

9^ 

(43) 

^2) 

-  K12> 

(k  *  lycii^S)} 

(44) 

and 

m 

Hlq2,p]  =  H[q2,qiJ  +  Hlq^pJ  +  £  }(F^ )-F^2))  (45) 

r=l 

also  hold. 


Discussion:  Property  10  can  apply  to  situations  in  which  q^  and  q£  are 
system  probability  densities  at  different  times  and  in  which  q^  or  estimates 
of  qj  are  considered  to  be  good  estimates  of  q£.  If  I2  is  determined  in 
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part  by  expectations  of  the  same  functions  as  Ij,  but  with  different 
expected  values,  then  the  results  of  taking  1^  into  account  are  completely 
wiped  out  by  subsequently  taking  Ij  into  account.  As  an  example,  consider 
frame-by-frame  minimum  cross-entropy  spectral  analysis  in  which  I.  is 
determined  by  autocorrelation  samples  in  frame  i  at  a  fixed  set  of  lags 
(s  "  m) .  Eq.  (42)  shows  that  the  results  for  frame  i  are  the  same  whether  the 
assumed  prior  is  an  original  prior  p,  the  posterior  from  frame  i-1,  or  some 
intermediate  estimate.  (However,  there  may  be  computational  or 
bandwidth-reduction  advantages  to  using  poI,._j  as  a  prior  in  frame  i.)  Mote 
that,  if  s  ^  m  and  F*13  •  F^23  for  r  *  l,...,m,  Property  14  reduces 
to  Property  13. 

Proof  of  14:  From  (A. 4)  we  have 


lj(x)  -  p(x)  exp  f-  Aq1  )  ’ 

'  k*l  ' 


where  the  are  chosen  to  satisfy  the  constraints  (40).  Similarly, 

q12(£3  “  ql<~)  exp("A0  ^"X^k  2)ak(i^  ’ 

holds.  This  is  of  the  form  p(x)exp[-A^23  -  ^^23a^(x)I . 

with*£2)  -^1)+^12>  (k  -  1 , .  • .  ,m)  and^2)  -^12> 

(k  ”  m+1, . . . ,8) ,  and  it  is  a  probability  density  satisfying  the  constraints 
(41);  it  is  therefore  equal  to  pslj  ■  q^,  which  proves  (43)-(44).  Eq. 

(45)  follows  from  straightforward  applications  of  (A. 5). 


Property  15  (expected  value  matching):  Let  I  be  the  constrainta 


(  dx  q^* (x)f.(x) 
I  <•  1  •>  k  «« 

J  n 


(k  *  1 , . • • ,m) 


for  a  fixed  set  of  functions  f^t  and  let  q  ■  p*I  be  the  result  of  taking 
this  information  into  account.  Then,  for  an  arbitrary  fixed  density 
q  *€#,  the  cross  entropy  H{q*,ql  “  H[q*,p«l]  has  a  minimum  value  when  the 
constraints  (46)  satisfy 


f  d5  **<£>*&>• 


Discussion:  This  property  states  that,  for  a  density  q  of  the  general  form 
(A. 4),  H[q*,q]  is  smallest  when  the  expectations  of  q  match  those  of  q*.  In 
particular,  note  that  q  *  pol  is  not  only  the  density  that  minimizes  H[q,p], 
but  also  is  the  density  of  the  form  (A.4)  that  minimizes  H[q  ,q] !  Property  15 
is  a  generalization  of  a  property  of  orthogonal  polynomials  [10]  that,  in  the 
case  of  speech  analysis,  is  called  the  "correlation  matching  property"  [9, 
Chapter  2] . 

Proof  of  15:  The  cross-entropy  H[q*,q]  is  given  by 

H[q*,q]  -  f  dx  q*(x)log(q*(x)/q(x)) 

JD 

*0 

«  jdx  q*(x)log(q*(x)/p(x))  ♦Jdg  q*(*XA0  ♦ 

-  Jdx  q*(x)log(q*(*)/p(*))  ♦  Aq 

4  (47) 

where  we  have  used  (A.4).  Now,  since  the  multipliers  are  functions  of 
the  expected  values  f^,  variations  in  the  expected  values  are  equivalent  to 
variations  in  the  multipliers.  Hence,  to  find  the  minimum  of  H[q*,q] ,  we  solve 


I 


W5PSM*' 


at  ■  0  '  *  £x  • 

where  we  have  used  (47).  It  follows  from  (A. 9)  that  the  minimum  occurs  when 

Vv 


V.  GENERAL  DISCUSSION 


Property  1  and  Eqs.  (12),  (14),  and  (16)  are  the  inference  axioms  on  which 
the  derivation  in  t4)  is  based.  It  is  important  to  recognize  that  it  is  these 
inference  properties,  and  not  the  corresponding  cross-entropy  properties  (Eqs. 
(13),  (15),  and  (17))  that  characterize  cross-entropy  minimization.  For  more 
information  on  this  distinction,  see  {4,  Section  VI]  and  (5]. 

An  interesting  aspect  of  the  results  presented  in  this  paper  is  the 
interplay  between  properties  of  cross-entropy  minimization  as  an  inference 
procedure  and  properties  of  cross-entropy  as  an  information  measure. 
Cross-entropy's  well-known  [1]  and  unique  [5]  properties  as  an  information 
measure  in  the  case  of  arbitrary  probability  densities  are  extended  and 
strengthened  when  one  of  the  densities  involved  is  the  result  of  cross-entropy 
minimization.  (See  the  statement  and  discussion  of  Properties  10,  11,  12,  13, 
and  15.)  Indeed,  the  resulting  combined  properties  have  led  to  a  new 
information-theoretic  method  of  pattern  analysis  and  classification  [11]  that 
is  a  refinement  of  a  method  due  to  Kullback  [1,  p.  83]. 
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APPENDIX  A 


Mathematic*  of  Cross-Entropy  Minimization 


Ve  derive  the  general  solution  for  cross-entropy  minimisation  given 
arbitrary  constraints,  and  ve  illustrate  the  result  with  the  important  cases 
of  exponential  and  Gaussian  densities.  In  general,  however,  it  is  difficult 
or  impossible  to  obtain  a  closed-form,  analytic  solution  expressed  directly  in 
terms  of  the  known  expected  values  rather  than  in  terms  of  the  Lagrangian 
multipliers.  We  therefore  discuss  a  numerical  technique  for  obtaining  the 
solution,  namely  the  Newton-Raphson  method.  This  method  is  the  basis  for  a 
computer  program  that  solves  for  the  minimum  cross-entropy  posterior  given  an 
arbitrary  prior  and  arbitrary  expected-value  constraints. 

Given  a  positive  prior  density  p  and  a  finite  set  of  equality  constraints 


d2 

}■ 


i  , 


f.  (x)q(x)  dx  -  f.  ,  (k  -  1,  ...  ,  m)  , 

K  K 

we  wish  to  find  a  density  q  that  minimizes 


(A.  1) 

(A.2) 


Hlq,p)  -  JqC*)  log  ^  dx  , 

subject  to  the  constraints.  Por  conditions  that  imply  the  existence  of  a 
unique  minimum,  see  the  discussion  of  Property  1  (uniqueness).  One  standard 
method  for  seeking  the  minimum  is  to  introduce  Lagrangian  multipliers  y(S  and 
Afc  (k  ■  1>  •  ••  >  ■)  corresponding  to  the  constraints,  forming  the  expression 


JW  l0*  $fr  *  flj*( s>  *  X  kJw'W  % 
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and  to  equate  the  varietion,  with  reepect  to  q,  of  thia  quantity  to  aerot 


io*  * 1  *  fi  *  X  Vk'J5  * 0  • 

k-l 


(4.3) 


Solving  for  q  lead*  to 


q(x)  -  p(x)  exp  A0  -  £  KV&j  » 


(4.4) 


where  we  have  introduced  A, 


■fi 


*  1. 


In  fact,  the  q,  if  it  exists,  that  minimises  H[q,p]  has  this  fora  with  the 

possible  exception  of  a  set  S  of  points  on  which  the  constraints  iaply  that  q 

vanishes.  (Such  a  situation  would  arise,  for  instance,  if  we  had  a  constraint 

1  q(x)f(x)dx  -  0,  where  f(x)  >  0  when  x  6  S  and  f(x)  -  0  when  x  £  S. 

J  ^  «* 

Informally,  we  could  then  imagine  the  Lagrangian  multipliers  becoming  infinite 
in  such  a  way  that  the  argument  of  exp  in  (4.4)  becomes  when  x  £  S. ) 
Conversely,  if  a  density  q  is  found  that  is  of  this  fora  and  satisfies  the 
constraints,  then  the  miniwum-croas-entropy  density  exists  and  equals  q  [6], 
[1].  For  simplicity  in  the  following,  we  assuae  the  set  8  is  es^ty. 

The  cross-entropy  at  the  minimum  can  be  expressed  in  teraa  of  the  A^  end 
the  fk  by  multiplying  (4.3)  by  q(x)  and  integrating.  The  result  is 


H[q,p]  -  -AQ  -  7*  Ai Jj  • 


(4.5) 


k-l 


It  is  necessary  to  choose  Aq  and  the  A^  so  that  the  constraints  are 
satisfied.  In  the  presence  of  the  constraint  (4.1)  we  aay  rewrite  the 
remaining  constraints  in  the  fora 


J 


(fk(x)  -  fk)q(x)  dx  -  o 


(4.6) 


2 


Now,  if  we  find  values  for  the  such  that 

\ 

0  ,  (i  -  l,...,m),  (A. 7) 

we  are  assured  of  satisfying  (A.6);  and  we  can  then  satisfy  (A.l)  by  setting 


j(f.(x)  -  f.)p(x)  exp  2lAkfk(*)^<lx 


log 


Jp(x)  exp^-^^yx^dx  . 


(A.8) 


If  the  integral  in  (A.8)  can  be  performed,  one  can  sometimes  find  values  for 


the  from  the  relations 

a  , 

'm/o  ■  fk 


(A.9) 


The  situation  for  inequality  constraints  is  only  slightly  more  complicated. 
Suppose  we  replace  all  the  equal  signs  in  (A. 2)  by  ^  .  (We  lose  no  generality 
thereby:  we  can  change  inequalities  with  £  into  inequalities 
.ith  <  b,  changing  the  aign.  of  th.  ccrr..,o«din«  and  tk,  and  an, 
equality  constraint  is  equivalent  to  a  pair  of  inequality  constraints.)  The  q 
that  minimizes  H(q,p)  subject  to  the  resulting  constraints  will  in  general 
satisfy  equality  for  certain  values  of  k  in  the  modified  (A.2),  while  strict 
inequality  will  hold  for  the  rest.  We  can  still  use  the  solution  (A.A), 
subjecting  the  Lagrange  multipliers  to  the  conditions  A ^0  for  k  such  that 
equality  holds  in  the  constraint,  and  ^k"0  for  k  such  that  strict  inequality 
holds  in  the  constraint. 


It  unfortunately  is  usually  impossible  to  solve  (A. 7)  or  (A.9)  for  the 
^  explicitly,  in  closed  form;  however,  it  is  possible  in  certain  important 
special  cases.  For  example,  consider  the  case  in  which  the  prior  p(x)  is  a 
multivariate  exponential, 


a.  io) 


n 

p(x)  ■  TT  expl-x^a^  * 

k"l 

where  x  -  (x, )  end  the  x.  each  range  over  the  poaitive  real  line, 

m+  1  Ti  K 

and  in  which  the  constraints  are 

^dx  x^x)  “  %  »  (A.  11) 

k  ■  1,  n.  Solving  (A. 9)  in  order  to  express  the  minima  cross-entropy 

posterior  directly  in  terms  of  the  known  expected  values  xk  yields 

q(x)  -  (1/*^)  «xpl~*k/*k]  •  (A.  12) 

k 

Thus,  the  density  reisains  multivariate  exponential,  with  the  prior  mean  values 
ak  being  replaced  by  the  newly  learned  values  x^. 

Now  consider  the  case  in  which  the  x^  range  over  the  entire  real  line, 
and  in  which  the  prior  density  is  Gaussian, 

p(x)  ■  "|""f  (2JTbk)  1^2«xpl-(xfc  -  afc)2/2bkJ  . 
k 

Suppose  that  the  constraints  are  (A. 11)  and 

j  d,  <„k  -  vVx>  -  »k  . 

In  this  case  the  minimum  cross-entropy  posterior  is 

q(x)  -  |  j  (2JTvk)’1^2expl-(xk  -  xk)2/2vkl  . 
k 

Thus,  the  density  remains  multivariate  Gaussian,  with  the  prior  maans  and 
variances  being  replaced  by  the  newly  learned  values. 
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Here  ie  en  wnple  of  e  staple  problem  for  which  the  solution  of  (A. 7) 

cannot  be  expressed  in  closed  form.  Consider  a  discrete  system  with  n  states 

x.  and  prior  probabilities  p(x.)  •  p.  (j  ■  1,  ...  ,  n).  The  discrete 
J  7  3 

form  of  (A.l)  is 


Z  * 1 » 

j-1 


(A. 13) 


where  ■  q(xj).  Suppose  the  only  other  constraint  is  that  the  mean  m  of 
the  indices  j  is  prescribed:  f(xj)  ■  j,  and 


Z  i«j  • 

j-1 


(A.  14) 


Then  (A.4)  becomes  q^  •  Pjexpt-Ag-Aj) ,  which  we  write  as  q^  ■  ap . z J 
by  introducing  the  abbreviations  a  ■  exp[-A^]  and  t  "  exp[~AJ .  From  (A. 16) 
and  (A.  17)  we  then  obtain 


(z  v1) 

\j«l  ' 


^  ( j-m)pj*^  -  0  .  (A. 15) 

j-1 

The  problem  then  reduces  to  finding  a  positive  root  of  the  polynomial  in 
(A. 15).  As  in  the  continuous  case,  there  are  special  forms  for  the  prior  that 
lead  to  important  particular  solutions.  But  when  n  >  5,  the  roots  of  the 
polynomial  (other  than  sero)  cannot  in  general  be  written  as  explicit, 
closed-form  expressions  in  the  coefficients  for  arbitrary  priors,  numerical 
methods  of  solution  therefore  become  important.  Our  obtaining  a  polynomial 


equation  in  the  present  example  was  an  accidental  consequence  of  the  feet  that 
the  values  of  the  constraint  function  f  formed  a  subset  of  an  arithmetic 


progression  (j  ■  1,  2,  ...  ).  Thus,  for  more  general  types  of  problems, 
muter  ical  methods  are  even  more  important. 

One  such  method  is  the  Nevton-Raphson  method,  which  is  for  finding 
solutions  for  systems  of  equations  that,  like  (A.7),  are  of  the  form 


f^(Aj*  •••  i ■  o 


(i  ■  1,  ...  ,  m)  . 


(A.  16 


The  method  sterts  with  an  initial  guess  at  the  solution, 

A^  *  ...  ,A^)t  «nd  produces  further  approximate 

w  l  m 


solutions  A^,  A^\  ...  in  succession.  If  the  initial  guess  A^  is 

A#  ^ 


close  enough  to  a  solution  of  (A.  16),  if  the  are  continuously 


differentiable,  and  if  the  Jacobian  (iF./JAj)  nonsingular,  then  the 


will  converge  to  the  solution  in  the  limit  as  r-*«®  . 


(r> 


The  method  is  based  on  the  fact  that,  for  small  changes  4 A  in  the 


arguments  we  have  the  approximate  equality 


v(r), 


f1(A(r)^AA(r))«»i(rfr))  --AA*r) 

i  -  xa  i  ~  %  2'r'  * 

k-l  *k 


up  to  a  term  of  order  o(AA^r^).  We  therefore  take^X''  to  be  a  solution 
of  the  linear  equation 
v(r), 


,(r) 


Z  ’  -V*(r)> 

k«i 


(A.  17 


and  set  A^r+1^  ■  A^  In  applying  the  Nevton-Raphson  method  to 
cross-entropy  minimisation,  we  let  F.^)  be  proportional  to  the  discrete 
form  of  the  left-hand  side  of  (A.7);  we  set 
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3S1 


I 


v*<r))  ■  i.  fijpj,xp  (■  £  aj%)  • 

j-l  \  u-l  / 

Sp-iwr/'Wj. 

**k  j-l  \  u-l  / 

where  f..  *  f.(x.)  -  f.,  and  we  have  removed  a  factor  of 
ij  j  i 

e*pl~^  A^f  !•  With  the  abbreviation 
r  “  u  u  u 


(A. 18) 


(A. 19) 


we  express  the  right-hand  sides  of  (A. 18)  and  (A. 19)  in  matrix  notation  as 
(-fdiag(g)  g] .  and  [-f  diag(g)2  -f  C] ..  ,  respectively,  where  diag(g)  is 

AS  ^  „  X  00  V  V  ** 

the  diagonal  matrix  whose  diagonal  elements  are  the  g.,  and  C  is  the 

J  ** 

transpose  of^.  The  solution  of  (A.17)  is  then  given  by 


AA(r)  «  [<*  diagtg)2**)"1*  diag(g)]  g  . 


We  remark  that  the  quantity  in  brackets  is  the  Noore-Penrose  generalized 
inverse  (12]  of  the  matrix diag(g).  The  approach  just  described  has  been 
made  the  basis  for  a  computer  program  [13],  written  in  APL,  for  solving 
cross-entropy  minimization  problems  with  arbitrary  positive  discrete  priors  p 
and  equality  constraints  specified  by  matrices £.  The  approach  is 
particularly  convenient  for  programing  in  APL  since  the  generalized  inverse 
is  a  built-in  APL  primitive  function  [14].  To  solve  a  minimum-cross-entropy 
problem  with  500  states  and  10  constraints,  the  program  typically  requires  15 
seconds  of  CPU  time  when  running  under  the  APL  SF  interpreter  on  a  DEC-10 
system  with  a  XI  central  processor. 

i 
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Gokhale  and  Kullback  [IS]  describe  a  somewhat  different  algorithm,  also 
based  on  the  Newton-Raphson  method,  that  has  been  implemented  in  PL/I.  Agmon, 
Alhassid,  and  Levine  [16] ,[17]  describe  yet  another  cross-entropy  minimisation 
algorithm  and  a  FORTRAN  implementation.  Tribus  [7]  presents  programs  in  BARIC 
that  compute  singly  and  doubly  trucated  Gaussian  distributions  as  maximum 
entropy  distributions  with  prescribed  means  and  variances. 


APPENDIX  B 


Remark  on  tha  Diacuaaion  of  Proparty  12 


In  tha  discussion  of  Proparty  12 ,  it  was  stated  that  for  any  prior  p  and 
any  density  r€#0  with  H(r,p)<**,  there  exists  a  finita  set  of  equality 
constraints  1^  such  that  r  ■  pol^,  In  fact,  at  wo  at  two  ara  needed.  Let 


f,(x) 

1  id 


f2(5) 


f 


> 


0  , 


log(p(x)/r(x)) 

V  ** 

0  , 


f2  -  -  H(r,p)  , 


r(x)  4  0 
00 

r(x)  -  0  , 


r(x)  qfc  0 

r(x)  »  0  , 
00 


and  impose  constraints 


j,(.)fl(x)*.  -  fj  , 

(’(‘,£2(J)d5  '  \  ■ 


The  first  constraint  implies  (p*I)(x)  “  0  where  r(x)  ■  0.  On  the 

complementary  set,  where  r(x)  4s  0,  define  q(x)  by  (A.4)  with  allJl.  ■  0 

"  "  J 

except  “  1$  this  gives  a  function  q  that  satisfies  the  second  constraint 
as  well  as  the  first  and  also  agrees  with  r.  Hence  r  *  q  is  the  result  of 
minimizing  H(q,p)  with  respect  to  (B.l)  and  (B.2). 
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